Skip to content

Add PostgreSQL support and restore related configurations#59

Merged
romeokienzler merged 7 commits intomainfrom
add_postgres_example
Apr 24, 2026
Merged

Add PostgreSQL support and restore related configurations#59
romeokienzler merged 7 commits intomainfrom
add_postgres_example

Conversation

@romeokienzler
Copy link
Copy Markdown
Collaborator

This pull request introduces a plugin-based architecture for Optuna coordinator (storage backend) selection in the iterate2 module, enabling flexible and extensible support for different storage backends such as SQLite, JournalFS, and PostgreSQL. It also adds first-class support for PostgreSQL as a coordinator backend, improves HPO configuration ergonomics, and provides example scripts and dependency management for PostgreSQL-based workflows.

Key changes include:

1. Coordinator Plugin System and PostgreSQL Support

  • Refactored storage backend selection in iterate2 to use a new plugin registry (CoordinatorPlugin), allowing easy addition of new coordinator backends. (terratorch_iterate/iterate2/plugin/coordinator/__init__.py, terratorch_iterate/iterate2/_iterate2.py, terratorch_iterate/iterate2/__init__.py, terratorch_iterate/iterate2/plugin/__init__.py) [1] [2] [3] [4]
  • Implemented and auto-registered coordinator plugins for SQLite, JournalFS, and PostgreSQL, with robust matching and configuration logic. (terratorch_iterate/iterate2/plugin/coordinator/sqlite.py, journalfs.py, postgresql.py) [1] [2] [3]
  • Added dependency management for PostgreSQL coordinator via a new postgresql extra in pyproject.toml, using psycopg2-binary.
  • Added a new example script for running HPO trials with LSF and PostgreSQL as the Optuna backend. (examples/run_lsf_gridfm_example_postgres.sh)

2. HPO Configuration Improvements

  • Consolidated HPO resource selection into a new compute group hyperparameter in gridfm_graphkit_hpo.yaml, which co-selects gpu_num, num_workers, and batch_size for better scaling and usability. [1] [2]
  • Updated metrics to use "Validation loss" as the primary tracked metric.

3. Reliability and Usability Enhancements

  • Added logic to automatically re-queue a portion of failed trials (25%) for retry, improving robustness of HPO runs. (terratorch_iterate/iterate2/_iterate2.py)
  • Improved logging and error handling throughout the coordinator selection and storage initialization process. [1] [2] [3] [4]

These changes collectively make the HPO workflow more robust, scalable, and easier to configure for a variety of cluster and database setups.

@romeokienzler romeokienzler merged commit 36e2ce3 into main Apr 24, 2026
0 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant